Cleaning Up Very Large Databases and Keeping Them Clean
نویسنده
چکیده
This presentation shows a real-world example of how a very large Customer database was cleansed and de-duplicated to shrink it down to a manageable size. The techniques used to do this are shown, as well as the processes that were implemented to maintain the new level of data cleanliness. The tricks and techniques are applicable to customer files or databases of any size in any business. Actual before and after data examples are shown.
منابع مشابه
Towards Automatic Detection of Erroneous Measurement Results in a Gravity Database
Geospatial databases often contain erroneous measurements. For some such databases such as gravity databases, the known methods of detecting erroneous measurements – based on regression analysis – do not work well. As a result, to clean such databases, experts use manual methods which are very time-consuming. In this paper, we propose a (natural) “localized” version of regression analysis as a ...
متن کاملMessing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms
We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, N...
متن کاملStreet-cleaning Problems and Practices
The problem of keeping our cities and towns clean is far more complex and baffling than most of our citizens realize. Many of our city officials do not truly realize its magnitude: they are aware that so many street sweepers, truck drivers, and white wings are on the payroll; but without investigation they naturally assume that the best possible job is being done. Many of us have visited other ...
متن کاملSoiled adhesive pads shear clean by slipping: a robust self-cleaning mechanism in climbing beetles.
Animals using adhesive pads to climb smooth surfaces face the problem of keeping their pads clean and functional. Here, a self-cleaning mechanism is proposed whereby soiled feet would slip on the surface due to a lack of adhesion but shed particles in return. Our study offers an in situ quantification of self-cleaning performance in fibrillar adhesives, using the dock beetle as a model organism...
متن کاملBayesian Data Cleaning for Web Data
Data Cleaning is a long standing problem, which is growing in importance with the mass of uncurated web data. State of the art approaches for handling inconsistent data are systems that learn and use conditional functional dependencies (CFDs) to rectify data. These methods learn data patterns–CFDs–from a clean sample of the data and use them to rectify the dirty/inconsistent data. While getting...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001